Apache Spark上的脱节设置 |
您所在的位置:网站首页 › spark 教程 菜鸟 › Apache Spark上的脱节设置 |
百度翻译此文
有道翻译此文
问题描述
I trying to find algorithm of searching disjoint sets (connected components/union-find) on large amount of data with apache spark. Problem is amount of data. Even Raw representation of graph vertex doesn't fit in to ram on single machine. Edges also doesn't fit in to the ram. Source data is text file of graph edges on hdfs: "id1 \t id2". id present as string value, not int. Naive solution that I found is: take rdd of edges -> [id1:id2] [id3:id4] [id1:id3] group edges by key. -> [id1:[id2;id3]][id3:[id4]] for each record set minimum id to each group -> (flatMap) [id1:id1][id2:id1][id3:id1][id3:id3][id4:id3] reverse rdd from stage 3 [id2:id1] -> [id1:id2] leftOuterJoin of rdds from stage 3 and 4 repeat from stage 2 while size of rdd on step 3 wouldn't changeBut this results in the transfer of large amounts of data between nodes (shuffling) Any advices? 推荐答案If you are working with graphs I would suggest that you take a look at either one of these libraries GraphX GraphFramesThey both provide the connected components algorithm out of the box. GraphX: val graph: Graph = ... val cc = graph.connectedComponents().verticesGraphFrames: val graph: GraphFrame = ... val cc = graph.connectedComponents.run() cc.select("id", "component").orderBy("component").show() 其他推荐答案In addition to @Marsellus Wallace answer, below full code to get disjoint sets from an RDD of edges using GraphX. val edges:RDD[(Long,Long)] = ??? val g = Graph.fromEdgeTuples(edges,-1L) val disjointSets:RDD[Iterable[Long]] = g.connectedComponents() //Get tuples with (vertexId,parent vertexId) .vertices //Group by parent vertex Id so it aggregates the disjoint set .groupBy(_._2) .values .map(_.map(_._1)) |
今日新闻 |
推荐新闻 |
CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3 |